A rank-predicted pseudo-greedy approach to efficient text selection from large-scale corpus for maximum coverage of target units

نویسندگان

  • Wei Li
  • Qiang Huo
چکیده

Selecting efficiently a minimum amount of text from a largescale text corpus to achieve a maximum coverage of certain units is an important problem in spoken language processing area. In this paper, the above text selection problem is first formulated as a maximum coverage problem with a Knapsack constraint (MCK). An efficient rank-predicted pseudo-greedy approach is then proposed to solve this problem. Experiments on a Chinese text selection task are conducted to verify the efficiency of the proposed approach. Experimental results show that our approach can improve significantly the text selection speed yet without sacrificing the coverage score compared with traditional greedy approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building Text Corpus for Unit Selection Synthesis

The present paper deals with building the text corpus for unit selection text-to-speech synthesis. During synthesis the target and concatenation costs are calculated and these costs are usually based on the prosodic and acoustic features of sounds. If the cost calculation is moved to the phonological level, it is possible to simulate unit selection synthesis without any real recordings; in this...

متن کامل

Towards Optimal TTS Corpora

Unit selection text-to-speech systems currently produce very natural synthesized phrases by concatenating speech segments from a large database. Recently, increasing demand for designing high quality voices with less data has created need for further optimization of the textual corpus recorded by the speaker. This corpus is traditionally the result of a condensation process: sentences are selec...

متن کامل

Machine learning for text selection with expressive unit-selection voices

We show that a ranking model produced by machine learning outperforms two baselines when applied to the task of selecting texts for use in creating a unit-selection synthesis voice with good domain coverage. The model learns to predict the estimated utility of an utterance based on features relating it to the utterances selected so far and a corpus of target utterances. Our analyses indicate th...

متن کامل

Design of Speech Corpus for Open Domain Urdu Text to Speech System Using Greedy Algorithm

Unit selection speech synthesis is one of the most widely used techniques for high quality text to speech (TTS) systems. A unit selection text to speech system requires a large database of recorded and annotated speech, which contains both phonetic and prosodic variations. Designing phonetically rich and balanced speech corpora with minimum number of utterances is an intricate task. Several opt...

متن کامل

Material Development and English for Academic Purposes Word Lists; a Reductionist Approach

Nagy (1988) states that vocabulary is a prerequisite factor in comprehension. Drawing upon a reductionist approach and having in mind the prospects for material development, this study aimed at creating an English for Academic Purposes Word List (EAPWL). The corpus of this study was compiled from a corpus containing 6479 pages of texts, 2,081,678 million tokens (running words) and 63825 types (...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008